07. A Basic Actor-Critic Agent
A Basic Actor-Critic Agent
M3 L5 07 A Basic ActorCritic Agent V2
One important thing to note here is that I use V(s;\theta_v) or A(s,a), but sometimes V_\pi(s;\theta_v) or A_\pi(s,a) (see the \pi there? See the \theta_v? What's going on?)
There are 2 thing actually going on in there.
- A very common thing you'll see in reinforcement learning is the oversimplification of notation. However, both styles, whether you see A(s,a), or A_\pi(s,a) (value functions with or without a \pi,) it means you are evaluating a value function of policy \pi. In the case of A, the advantage function. A different case would be when you see a superscript . For example, A^(s,a) means the optimal advantage function. Q-learning learns the optimal action-value function, Q^*(s,a), for example.
- The other thing is the use of \theta_v in some value functions and not in others. This only means that such value function is using a neural network. For example, V(s;\theta_v) is using a neural network as a function approximator, but A(s,a) is not. We are calculating the advantage function A(s, a) using the state-value function V(s;\theta_v), but A(s, a) is not using function approximation directly.